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Abstract 

Domain adaptation aims at training a classifier in one dataset and applying it to a related but not identical dataset. One 
successfully used framework of domain adaptation is to learn a transformation to match both the distribution of the features 
(marginal distribution), and the distribution of the labels given features (conditional distribution). In this paper, we propose 
a new domain adaptation framework named Deep Transfer Network (DTN), where the highly flexible deep neural networks 
are used to implement such a distribution matching process. 

This is achieved by two types of layers in DTN: the shared feature extraction layers which learn a shared feature subspace 
in which the marginal distributions of the source and the target samples are drawn close, and the discrimination layers which 
match conditional distributions by classifier transduction. We also show that DTN has a computation complexity linear 
to the number of training samples, making it suitable to large-scale problems. By combining the best paradigms in both 
worlds (deep neural networks in recognition, and matching marginal and conditional distributions in domain adaptation), 
we demonstrate by extensive experiments that DTN improves significantly over former methods in both execution time and 
classification accuracy. 


1. Introduction 

Conventional machine learning requires a large amount of training samples in order to train a reliable model. In many 
real-world applications, however, it’s expensive and sometimes even impossible to get enough labeled training data. A 
straightforward solution is to train a model on a labeled dataset, related but not identical to the target task, and apply the 
model to the data being considered. Unfortunately, performance of the model may drop dramatically during the domain 
transfer, due to different feature and label distributions 0. In order to solve this problem, domain adaptation which aims at 
learning from one dataset (source dataset) and transferring the knowledge to a related but not identical dataset (target dataset) 
becomes an active research topic 1251 . Such methods are being widely used in object classification ID, object detection 
EJCH) one-shot learning f9j etc. 

The key of most domain adaptation methods is to learn a transformation on the features to reduce the discrepancy of 
the distributions between the source and the target datasets. There are different situations in real-world problems. 1) The 
marginal distributions are different, while the conditional distributions are similar. 2) The marginal distributions are similar, 
while the conditional distributions are different. 3) Both the marginal and the conditional distributions are different l30l . 

Instance reweighting 001 and subspace learning fl2l |Tj, |T9j (27] l24l are two typical learning strategies for domain 
adaptation 122 • The former reduces the distribution discrepancy by reweighting the source samples, and trains a classifier on 
the weighted source samples. The latter tries to find a shared feature space in which the distributions of the two datasets are 
matched. There are also methods performing instance reweighting and subspace learning simultaneously and achieving the 
state-of-the-art results in many benchmark datasets ll22ll . 

Classifier transduction represents another independent line of research for domain adaptation ll20ll . It directly designs an 
adaptive classifier by incorporating distribution adaptation with model regularization (7J[23]|20). A typical strategy is to use 
certain regularization on the conditional distribution in the optimization process. For example, minimizing a metric between 
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the outputs of a classifier can reduce the discrepancy of the conditional distributions of the source and the target datasets li20l . 
thus an adaptive classifier can be trained. 

Recently, it has been shown in many challenging cases, both the marginal and the conditional distributions may be dif¬ 
ferent between different domains. Simultaneously matching both of them can significantly improve the performance of the 
final classifier (30l [22l l20l . Methods belonging to this category achieved outstanding performance on benchmark datasets. 
However, many of them have high computational complexity ( 0(n 2 ) in l22ll and 0(n 3 ) in | [2(J| ). It makes computations 
prohibitively expensive on large-scale problems. 

Neural networks based deep learning approaches have achieved many inspiring results in machine learning and pattern 
recognition ED. Recently, neural networks have been successfully used in solving domain adaptation such as sentiment 
classification IH and pedestrian detection lf29l . However there are few works solving general domain adaptation problem 
by explicitly matching the distributions of datasets through neural networks. Such explicit matching strategy has been shown 
to be crucial in the state-of-the-art methods l30l[22]l20l . 

In this work, we propose the Deep Transfer Network (DTN), where a deep neural network is used to model and match 
both the marginal and the conditional distributions. The marriage of deep neural networks and domain adaptation provides 
us three unique advantages. 

• The neural network structure makes our model suitable to achieve domain transfer by simultaneously matching both the 
marginal and the conditional distributions. We achieve this by two different types of layers in DTN: the shared feature 
extraction layer which learns a subspace to match the marginal distributions of the source and the target samples, and 
the discrimination layer which matches the conditional distributions by classifier transduction (Section[2]i. 

• Compared to the former works, the proposed method can be optimized in 0{n) time, where n is the number of training 
data points. Compared to other methods (0(n 2 ) in 12 1 1 [22] and 0(n 3 ) in l20l ). it is more suitable to deal with 
large-scale domain adaptation problems (Section[3]and[4]). 

• By combining the best paradigms of both worlds (neural networks in recognition, and matching both the marginal 
and the conditional distributions in domain adaptation), we show by extensive experiments that the proposed method 
outperforms the state-of-the-art methods in both computation time, and accuracy. In order to show our method on 
a large-scale setting, we have tested on two datasets, which are 10 times the size of the widely used Office-Caltech 
dataset. Impressively, on USPS/MNIST dataset, DTN achieves 28.95% improvement in accuracy (Section^. 


2. Deep Transfer Network 

2.1. Problem Formulation 

Let x £ be a d-dimension column vector. x s £ and x 4 £ represent samples in the source and the target datasets 
respectively, y £ ffi. is the corresponding label of x. Given a labeled source dataset D s = {(xf,y s ),..., (x s a ,?/ s .,)} and 
an unlabeled target dataset D 4 = {x^.. .., x 4 1 }, where n s and n f are the numbers of the data points, the goal of domain 

adaptation is to learn a statistical model using all the given data to minimize the prediction error II Vi ~ Vi II ' n t ^ le 

target dataset, where y\ is the predected label of the i th target data point by the model, and y\ is the corresponding true 
label, unknown in training. We consider the case where both the marginal distributions and the conditional distributions 
of the source and the target datasets are different. In the following of this paper, P(x s ) and P(x 4 ) represent the marginal 
distributions of the source and the target datasets, while P(y s |x s ) and P(y t |x 4 ) represent the conditional distributions. 

2.2. Preliminary 


In DTN, distribution matching strategy is applied. In order to match the distributions, a metric of difference between 
two distributions needs to be defined. We follow mi and adopt the empirical Maximum Mean Discrepancy (MMD) as the 
nonparametric metric. 

MMD is a commonly used metric of discrepancy between two distributions, due to its efficiency in computation and 
optimization J26j. Denote X s = [xf,..., x s a ] £ U. dxn ‘ and X 4 = [x*...., x 4 j( ] £ R dxrat as the data matrices of D s and 
ZP respectively. Denote the data matrix X = [X s , X 4 ] as the combination of X s and X 4 . Then MMD between the marginal 
distributions of the source and the target datasets is defined as: 
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where M is the MMD matrix. Let Af. t) be one element of M. M, :j can be calculated as: 


T l/(n s ) 2 , i^n s ,j^n s 
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A typical neural network consists of two types of layers: 

• Feature extraction layer, which projects the input data onto another space by linear projection (or convolution) followed 
by an nonlinear activation function. 

• Discrimination layer, where the final label prediction is preformed. 

In the Convolutional Neural Networks (CNNs), for example, the feature extraction layers consist of the convolutional and 
fully connected layers, and the discrimination layer consist of the softmax regression layer. In the Multi-Layer Perceptrons 
(MLPs) with l layers (shown in Figure [TJ, the first l — 1 fully connected layers are feature extraction layers, while the last 
softmax regression layer is the discrimination layer. In the following, we will develop DTN based on MLPs. Our method can 
be easily extended to other types of neural networks. 

2.3. Matching Marginal Distributions 

MLPs define a family of functions. We consider a single layer neural network first. The single layer neural network maps 
x to an k-dimension feature vector h by a linear projection and a nonlinear vector-valued activation function /(•): 

h = /(Wx), (3) 


where W is an k x d projection matrix. The typical choices of /(•) include tanh(a) = ^ a+ ^_ a j and sigmoid(a) = . 

The single layer neural network can be connected to form a deep neural network, where the output of one layer is used as the 
input of another layer. 

For a MLP with l layers, the first l — 1 layers are all feature extraction layers. Let h(7 — 1) be the feature of x in the l — 1th 
layer. Denote P{h s (l — 1)) and P(h f (7 — 1)) as the distributions of the features of the source and the target datasets. The 
goal of our method is to make P(h s (Z — 1)) and P(h 4 (7 — 1)) to be close. This is achieved by minimizing the MMD when 
training the feature mapping function. 

Let H S (Z — 1) = [hf (Z — 1),..., h;j s (l — 1)] £ R fcxnS be the feature matrix of the source dataset, while H‘(Z — 1) = 
[hj(Z — 1),..., h 4 (l — 1)] £ M fexnt be the feature matrix of the target dataset. And H(Z — 1) = [H S (Z — 1), H 4 (Z — 1)] is the 
combination of H S (Z — 1) and H 4 (7 — 1). The distance between P(h s (Z — 1)) and P(h 4 (7 — 1)) is modelled by the marginal 
MMD as follows: 
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where M is the MMD matrix defined in Eq. [2] Later, we will show how to minimize the above. 


2.4. Matching Conditional Distributions 


In this paper, we consider the logistic regression as the model of the discrimination layer. It projects the input data points 
to a set of hyperplanes, and the distances to the planes reflect the posterior probabilities. Assume there are a total of C 
categories in the dataset. For an arbitrary category c, the hyperplane of category c is denoted as w c . The posterior probability 
of y - c given feature x can be modelled as: 


P(y = c|x) = softmax c (w^x) 



(5) 


The distance between the conditional distributions of the source and the target dataset is measured by the conditional 
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Figure 1. An example of MLP with l layers. The first l — 1 layers are feature extraction layers, and the last layer is the discrimination layer. 
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where q c G R" +" is the posterior distribution output vector of the c th category for all the data points, and M is the MMD 
matrix defined in Eq. [2] The smaller the conditional MMD is, the closer the conditional distributions of the source and the 
target datasets. Later, we will show how to minimize the above. 

In order to estimate the conditional distribution of the target samples, the labels of the target samples need to be given. 
However the target labels are unknown in unsupervised domain adaptation. In such case, one can use some base classifiers, 
e.g., SVMs, MLPs, to obtain the pseudo labels for the target samples. In this work, the non-transfer neural networks are 
applied as the base classifier. And we iteratively update the labels of the target samples to ensure a good performance (the 
procedure is shown in Section[3]i. This empirically provides good performance. 

2.5. Final Objective Function 


By using the output of the last feature extraction layer as the input of the discrimination layer, we get DTN. The feature 
extraction layers find a shared subspace, in which the marginal distributions of features of the source and the target datasets 
are matched. Then an adaptive classifier is trained in the new shared feature space to match the conditional distributions. The 
final objective of DTN is given in this part. 

We use the negative log-likelihood, a commonly used loss function in the objective function. It is defined as: 

n 

— £(W) = — log (P(y = W)). (7) 
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where W is the projection matrix of the neural networks in Eq. [3] By integrating Eq. [4]and Eq. [6]into Eq. [7] we can get the 
final objective function of DTN: 

J(W) = —£(W) + ATr(H(Z - l)MH r (I - 1)) 
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where A and /i are marginal and condition distributions regulation parameters, which are further discussed in Section[4] Then, 
W can be calculated by minimizing Eq. [8] The final objective function does not depend on the exact form of the loss function 
and the structure of the network - one can easily extend the formulation to other types of neural networks, such as CNNs and 
Deep Belief Networks (DBNs), in addition to MLPs. 


3. Optimization of Deep Transfer Network 
3.1. The Optimization Procedure 

We first introduce some notations. For an arbitrary training sample x, denote the posterior (conditional) probability output 
of the final classifier of this sample as p. p G is a C'-dimensional vector, where each dimension represents a posterior 
probability of certain category. It is easy to show that: Vh(i-i)(MMD mor ) = 
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and V p (MMD con ) = 
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We then use the backpropagation algorithm to optimize the neural network. Denote one element in W as W t j . The partial 
derivative for W 7J of Eq. [8]can be written as: 
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where <9h(Z — 1) / dW. t j and dp/dWij are vectors consisting of partial derivatives of each element in h(7 — 1) and p with 
Wtj. They can be easily computed if the structure of the network is given. We can optimize the parameters of the network 
with stochastic gradient descent. 

One issue to be addressed is that the computation of the marginal MMD or the conditional MMD requires taking all 
the source and the target samples into consideration. This is very inefficient, especially when the training dataset is large. 
Inspired by the idea of mini-batch stochastic gradient descent, we divide all the samples into different batches. Let N be the 
number of the batches we want to build. The source and the target datasets are divided into N parts. A training batch consist 
of n s /N source samples and n*/N target samples. It’s easy to see: 
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where B is the set of all the batches, Bk is the k th batch in B. n s k = n s /N and nj, = n* /N are the numbers of the source and 
the target datasets in Bk- Therefore, if all the mini-batches are matched very well, the whole datasets are matched. Instead of 
computing MMD over the whole datasets, we compute MMD over every single batch. 






One Batch 



Copied Data 

Figure 2. Method of building the batches of data. The samples in D l are copied (dashed rectangles) so that the two datasets have same 
sizes. The training batches are then built by randomly picking samples from two datasets. 


Algorithm 1 Optimization of the Deep Transfer Network 
Input: Source data with label X s , Y s and target data X*. 

Output: Parameters of deep transfer network W and predicted labels of the target samples Y*. 

1: begin: 

2: Set 1 = 0. Get Yq by training a baseline neural network with X s , Y s and testing with X*. 

3: repeat 
4: i = i + 1. 

5: Make mini-batches. 

6: Get W by optimizing Eqj8 using source data X s , Y s and target data X f , Y /_,. 

7: Predict X f with the network to get Y/. 

8: until Y/ = Y/_ ± or i > T. 

9: return W, Y f = Y/. 


The training batches are built as follows. First, the samples of the smaller dataset are randomly picked and copied such 
that the source and the target datasets have the same number of samples. Assume the size of single batch is S. Then batches 
are filled by randomly picking S/2 samples from the source dataset and other S/2 samples from the target dataset, until all 
the samples are selected. The batch size S should be big enough so that every mini-batch can reflect the variance of the whole 
dataset. An empirical analysis on S is given in section [4] Figure [2] illustrates the process of building the batches of data. 
Finally, gradient descent on the mini-batch is applied to optimize the objective function. 

As discussed in Section [2] the outputs of a non-transfer neural networks are utilized as the pseudo labels of the target 
samples. It is not surprising that the more accurate the labels of the target data are provided, the better performance of the 
final classifier. In particular, DTN can use its output as input to improve itself. We find that iteratively updating the labels of 
the target samples during training can significantly improve the performance of DTN, especially on large-scale datasets. We 
analyze the convergence property of DTN in Section[4] Algorithm[T|summarises the optimization of DTN. 


3.2. Computational Complexity 

Fet n = max(n s , ro t ) be the size of the bigger dataset, and the number of batches in B to be N. For each mini-batch, the 
cost of computing Eq. [9]and 10 are both O(S), where S is the size of one batch. More samples in a single batch always 
leads to higher computational cost when computing the gradient over the batch. The computational complexity of forward 
propagation over one batch is also O(S). Since there are a total of N batches and S x N = 2 n, the total computational 
complexity of the backpropagation in DTN over the whole datasets is 0(n). Unlike other high order domain adaptation 
algorithms, such as. Transfer Joint Matching ( 0(n 2 )) l22l and Adaptation Regularization based Regularized Feast Squares 


( 0(n 3 )) |20) . the execution time of DTN grows linearly in terms of the number of input samples. Thus, DTN is suitable to 
be applied on large-scale datasets. 


4. Experiments 

In this section, we conduct comprehensive experiments to evaluate the performance of DTN. Besides standard domain 
adaptation datasets widely used in existed literatures, we also develop two new settings which have 10 times more samples 
than existed datasets. 



















Table 1. Classification Accuracies (%) on Office-Caltech Dataset 


Method 

A/W 

A/D 

A/C 

W/A 

W/D 

W/C 

Database 

D/A 

D/W 

D/C 

C/A 

CAV 

C/D 

Avg 

NN 

29.83 

25.48 

26.00 

22.96 

59.24 

19.86 

28.50 

63.39 

26.27 

23.70 

25.76 

25.48 

31.37 

FSSL 

34.35 

26.37 

33.91 

29.53 

76.79 

25.85 

30.61 

74.99 

27.89 

35.88 

32.32 

37.53 

38.84 

TCA 

35.25 

34.39 

40.07 

28.81 

85.99 

29.92 

31.42 

86.44 

32.06 

45.82 

30.51 

35.67 

43.03 

GFK 

41.69 

37.58 

37.58 

27.77 

82.17 

30.10 

33.61 

79.66 

30.45 

41.54 

35.93 

43.31 

43.45 

TJM 

42.03 

45.22 

39.45 

29.96 

89.17 

30.19 

32.78 

85.42 

31.43 

46.76 

38.98 

44.59 

46.33 

RLS 

38.98 

38.85 

42.20 

38.62 

78.34 

34.46 

34.13 

79.66 

33.21 

53.86 

49.49 

44.59 

47.20 

ARRLS 

38.98 

38.85 

42.48 

39.87 

78.34 

34.73 

37.16 

82.71 

32.24 

54.91 

50.51 

44.59 

47.95 

MLP 

36.80 

46.00 

41.45 

33.26 

70.00 

32.91 

31.37 

74.40 

29.64 

51.05 

48.80 

45.33 

45.08 

DTN 

43.00 

56.00 

42.90 

36.89 

84.00 

34.18 

34.89 

87.50 

32.27 

54.00 

58.50 

56.00 

51.68 


4.1. Domain Adaptation on Small-Scale Datasets 

We use the publicly available Office-Caltech datasejj to evaluate domain adaptation algorithms on small-scale dataset. 
Office-Caltech dataset, first released by Gong et al. ifT^TT consists of the Office dataset and the Caltech-256 dataset. Office 
dataset contains three domains: Amazon (online merchants images downloaded from www.amazon.com), DSLR (image of 
corresponding merchants captured by DSLR camera in realistic environments) and Webcam (image captured by webcam). 
And Caltech-256 dataset has 30,607 images in 256 categories. The four domains (Amazon (A), Webcam (W), DSLR (D), 
and Caltech (C)) share 10 object categories in total. In the corresponding categories they have 958, 295, 157 and 1,123 
image samples respectively, with a total of 2,533 images. For all the images, SURF features are extracted and quantized 
into an 800-bin histogram with codebook trained from a subset of images in Amazon. The dataset is a standard benchmark 
for evaluating domain adaptation algorithms. By randomly selecting two different domains out of four domains to form the 
source and the target datasets, we can have 12 source/target cross-domain pairs, marked as AAV, A/D, A/C,..., C/D. 

We report results of DTN on all the 12 pairs and compared to the methods reported in li20l and f22l . The methods include: 

• 1-Nearest Neighbor Classifier (NN) 

• Joint Feature Selection and Subspace Learning (FSSL) |fl3l + NN 

• Transfer Component Analysis (TCA) ll24ll + NN 

• Geodesic Flow Kernel (GFK) fl2l + NN 

• Transfer Joint Matching (TJM) lf22l + NN 

• Regularized Least Squares (RLS) 

• Adaptation Regularization based Regularized Least Squares (ARRLS) |[20l 

• Multiple Layer Perceptrons (MLP) 

Among those, NN, MLP and RLS are the base classifiers without knowledge transfer. FSSL, TCA, GFK, TJM are subspace 
learning methods, while ARRLS is classifier transduction method. In particular, MLP is the non-transfer base classifier of 
DTN. And ARRLS is the method most closely related to DTN, since they both use marginal MMD and conditional MMD 
to evaluate the discrepancy between distributions. However, their underlying classification models are different. DTN is 
implemented using Theano, a parallel Python package |2|. 

We follow the same protocol in fl9l l22l in the evaluation. The performance of DTN depends on the architecture of the 
neural networks. Besides the architecture of the neural networks, DTN involves four meta parameters: the regularization 
parameter for marginal distribution matching A, the regularization parameter for conditional distribution matching //, the 
batch size S which determines the size of the mini-batch, and the iteration number T controlling how many times the labels 
of the target samples are updated during the training process. For small-scale datasets, we only use the labels by non-transfer 
MLP during training. 

In this experiment, we only use a neural network with one hidden layer. In other words, the base neural network only have 
one feature extraction layer and one discrimination layer. The number of hidden nodes is 500. We use the default setting 
A = /jl = 10 throughout the whole dataset. We empirically set batch size S = 200, which means that every mini-batch 
consists of 100 source samples and 100 target samples. However, since samples in DSLR and Webcam domains are too few 

1 From https://www.eecs. berkeley.edu/~jhoffman/domainadapt/ 







Table 2. Classification Accuracies (%) on Large-Scale Benchmark Datasets 


Database 

GFK 

RLS 

Method 

ARRLS MLP (CNN) 

DTN (ours) 

USPS/MNIST 

33.71 

16.57 

52.09 

44.47 

81.04 

CIFAR/VOC 

49.15 

62.43 

71.73 

59.04 

73.60 


to build enough batches for training, we set s = 100 for all the source/target pairs which involve DSLR or/and Webcam 
domains. The classification accuracy is used as an evaluate metric as in Ifl9ll22 il. 

Table [T] shows all the obtained classification accuracies on all the source/target pairs in Office-Caltech dataset for DTN 
and all the eight baseline methods. We observe that, DTN achieves the best performance on 6 out of 12 datasets, and on 
the other two datasets, the performance of DTN is only slightly worse (less than 1% of the classification accuracy) than the 
best baseline method. The average classification accuracy on all the 12 datasets of DTN is 51.68%, which is 3.73% higher 
compared to the best baseline method ARRLS. Compared to its non-transfer base classifier MLP, DTN gains improvement on 
all 12 datasets and finally outperforms MLP by 6.60% on average accuracy. This fact proves that, simultaneously minimizing 
MMD of both the marginal and the conditional distributions of the source and the target datasets can significantly improve 
the performance of the trained classifier on the unlabeled target dataset. 

We also notice DTN performs better than ARRLS. These two methods both use MMD to evaluate the marginal and 
the conditional distributions divergences between two datasets. The only difference is that the base classifiers, DTN uses 
MLP and ARRLS uses RLS. This proves that the combination of the best paradigms of both worlds (neural networks in 
recognition, and matching both the marginal and the conditional distributions in domain adaptation) works well. We also 
observe that methods matching both the marginal and the conditional distributions (TJM, ARRLS and DTN) always perform 
better than methods matching marginal distributions only (TCA and GFK). The reason is that, both the marginal and the 
conditional distributions may change over different domains in real-world problems. Matching marginal distributions only 
cannot guarantee small discrepancy between conditional distributions, and the discriminative directions of the source and the 
target domains may still be different f20l . 

4.2. Domain Adaptation on Large-Scale Datasets 

We develop two new large-scale settings to evaluate the performance of domain adaptation algorithms, which are USPS/MNIST 
dataset and CIFAR/VOC dataset. These two datasets are almost 10 times larger than current datasets. For both the settings, 
we use one dataset as the source set and the other one as the target set to evaluate our method. 

USPS/MNIST. USPf0and MNIS'I^] datasets are two handwriting datasets widely used in classification algorithm eval¬ 
uation. USPS dataset consists of 10,000 training and testing images with image size of 16x16, while MNIST dataset has 
50,000 training images with image size of 28x28. We follow the same preprocessing procedure in Ifl9ll . However, rather 
than randomly sampling a subset, we use all the training and testing images in the USPS dataset as the source samples and 
all the training samples in the MNIST dataset as the target samples. Thus, our USPS/MNIST dataset has 60,000 samples in 
total, which is 20 times larger than the one used in ED- 

CIFAR/VOC. CIFAR-10 datasej^j llTbl is a labeled subset of the 80 million tiny images dataset (28). It consists of 60,000 
32x32 colour images in 10 categories, 6,000 images per class. Pascal VOC 2012 dataset® is designed for recognizing 
objects in realistic scenes. It has 20 classes with a total of 11,530 images. Some sample images of CIFAR-10 and Pascal 
VOC 2012 are shown in figure [3] The CIFAR-10 dataset consists of very tiny images while the images in Pascal VOC 2012 
look like images taken from the Internet. They follow very different distributions. CIFAR-10 dataset and Pascal VOC 2012 
dataset share 6 semantic categories, “aeroplane/airplane”, “bird”, “car”, “cat”, “dog” and “horse”. We therefore construct 
CIFAR/VOC dataset by randomly sampling 15,000 images (2,500 per category) from CIFAR-10 dataset to form the source 
dataset and selecting all the samples in Pascal VOC 2012 dataset to form the target dataset. For all the 17,720 image samples, 
DeCAF feature lfj~5l is extracted. We use the 6th layer output, which leads to a 4096-dimension feature. 

A convolutional neural network (CNN) is applied to USPS/MNIST dataset as the base classifier. The CNN is based 
on LeNet model, which was first proposed in |[T8| . It has two convolutional and max-pooling layers with filter size 3x3 

2 From http://www.csie.ntu.edu.tw/'cjlin/libsvmtools/ 
datasets/multiclass.htmljjusps 

■’From http://yann.lecun.com/exdb/mnist/ 

4 From http://www.cs.toronto.edu/~kriz/cifar.html 

5 From http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2012/ 
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(a) Sample images in CIFAR-10 dataset. 



(b) Sample images in Pascal VOC 2012 dataset. 


Figure 3. Sample images in CIFAR/VOC dataset. 
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Figure 4. Parameters analysis on USPS/MNIST and CIFAR/VOC datasets and execution time on various sizes USPS/MNIST datasets. 






































Table 3. Execution Time (s) on Large-Scale Benchmark Datasets 




Method 

Database 

GFK 

ARRLS 

DTN (ours) 

CIFAR/VOC 

153 

640 

1,428 

USPS/MNIST 

34 

7,346 

4,548 


and a fully-connected layer with 500 nodes. A multilayer perceptron with 3 hidden layers is used as the base classifier for 
CIFAR/VOC dataset. The numbers of the nodes in each hidden layer are 2,000, 1,000 and 500 respectively. For parameter 
settings, we also set the distribution regulation parameters to the default setting A = /r = 10. The mini-batch size S for 
MINST/USPS is set to 4,000, which means that one mini-batch consists of 2,000 source samples and 2,000 target samples. 
For CIFAR/VOC, S = 2, 000. Iteration number T is set to 20, meaning that the labels of the target samples are iteratively 
updated 20 times during training. 

Table [2] shows the results of DTN and some baseline methods. We observe that the classification accuracy of DTN on 
USPS/MNIST and CIFAR/VOC datasets are 81.04% and 73.60% respectively. Compared to its non-transfer base classifier 
neural networks, it achieves improvements of 36.57% and 14.56%. The best baseline method is ARRLS, which gets accura¬ 
cies of 52.09% and 71.73% and gains 35.52% and 9.3% improvements from its non-transfer base classifier RLS. Compared 
to the best baseline method (ARRLS), DTN gains improvements of 28.95% and 1.87%. We observe that DTN performs 
significantly better than any other baseline method on USPS/MNIST dataset. It’s because of the extraordinary performance 
of its base classifier CNN in digit recognition. 

Table[3]shows the execution time of some algorithms run on the large-scale datasets. DTN has competitive speed compared 
to the baseline method with best performance (ARRLS). In order to directly display the computational complexity, we 
randomly sample USPS/MNIST to different sizes, from 10,000 to 60,000 and plot the execution time of DTN and ARRLS 
on Figure [4(d)| It shows the execution time of DTN grows almost linearly with the number of the training samples, however, 
ARRLS grows much faster. It should also be remarked here that ARRLS requires over 100GB memory to save the kernel 
matrix, while less than 3GB video memory is enough for DTN. The experimental result proves that DTN can deal with very 
large-scale dataset with appealing transfer performance. 


4.3. Parameter Sensitivity 


We conduct parameter sensitivity experiment on USPS/MNIST and CIFAR/VOC dataset. Distribution matching parame¬ 
ters A, fj,, size of the mini-batch S and iteration number T are evaluated. 

Distribution Regularization Parameters A , [i. A controls the level of marginal distribution matching, and fi con¬ 
trols the level of conditional distribution matching. The larger the values are, the smaller the differences of distribu¬ 
tions. For simplification, we set A = /i. Figure 4(a) shows the classification accuracies with different values taken from 


{0,1, 5,10, 50,100, 500}. We noticed that A and /i can be chosen from [1,100]. Throughout the whole experiment part, we 
choose A = fi = 10 as default. 

Batch Size S. The mini-batch is the basic unit for evaluating the distribution of the database and optimizing the objective 
function. The size S should be large enough to contain enough samples in the batch so that it can reflect the distribution of 
the whole dataset. Figure [4(b)] shows the classification accuracies with different values taken from {200,400, 600, 800,1000, 
2000,4000,8000}. We notice that bigger batch always leads to better performance. For large-scale dataset, S can be cho¬ 
sen from [2000,8000]. On the other hand, the size of the batch should not to be too large, due to limited GPU memory 
(experiments on CIFAR/VOC with batch size 4,000 and 8,000 failed due to insufficient video memory). Finally, we choose 
5 = 4,000 for USPS/MNIST and 5 = 2,000 for CIFAR/VOC. 

Iteration Number T. DTN can use its output as its input and alternatively improve the performance. T determines how 


many times the labels of the target datasets are updated during training. Figure 4(c) shows that the classification accuracy 
increases steadily with more iterations and finally converges (the classification accuracies of USPS/MNIST are plotted every 
4 iterations). Finally, we choose T = 20 for experiments on large-scale datasets. 

The unsupervised DTN can also be extended to a semi-supervised version easily. Labeled samples in target dataset can 
always provide better estimation of the distribution of the target data. 












5. Conclusion 


We proposed the Deep Transfer Network (DTN), which combines the best paradigms in object recognition (neural net¬ 
work) and domain adaptation (matching both the marginal and the conditional distributions). The structure of neural networks 

makes efficient modeling and optimization of the distribution matching process. 

DTN has a computational complexity 0(n), making it suitable to be applied on large-scale domain adaptation problems. 

Comprehensive experiments showed that DTN is effective on a variety of benchmark datasets and it significantly outperforms 

competitive methods especially on large-scale problems. 
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